Systematic review of diagnostic and prognostic host blood transcriptomic signatures of tuberculosis disease in people living with HIV

Background HIV-associated tuberculosis (TB) has high mortality; however, current triage and prognostic tools offer poor sensitivity and specificity, respectively. We conducted a systematic review of diagnostic and prognostic host-blood transcriptomic signatures of TB in people living with HIV (PLHIV). Methods We systematically searched online databases for studies published in English between 1990-2020. Eligible studies included PLHIV of any age in test or validation cohorts, and used microbiological or composite reference standards for TB diagnosis. Inclusion was not restricted by setting or participant age. Study selection, quality appraisal using the QUADAS-2 tool, and data extraction were conducted independently by two reviewers. Thereafter, narrative synthesis of included studies, and comparison of signatures performance, was performed. Results We screened 1,580 records and included 12 studies evaluating 31 host-blood transcriptomic signatures in 10 test or validation cohorts of PLHIV that differentiated individuals with TB from those with HIV alone, latent Mycobacterium tuberculosis infection, or other diseases (OD). Two (2/10; 20%) cohorts were prospective (29 TB cases; 51 OD) and 8 (80%) case-control (353 TB cases; 606 controls) design. All cohorts (10/10) were recruited in Sub-Saharan Africa and 9/10 (90%) had a high risk of bias. Ten signatures (10/31; 32%) met minimum WHO Target Product Profile (TPP) criteria for TB triage tests. Only one study (1/12; 8%) evaluated prognostic performance of a transcriptomic signature for progression to TB in PLHIV, which did not meet the minimum WHO prognostic TPP. Conclusions Generalisability of reported findings is limited by few studies enrolling PLHIV, limited geographical diversity, and predominantly case-control design, which also introduces spectrum bias. New prospective cohort studies are needed that include PLHIV and are conducted in diverse settings. Further research exploring the effect of HIV clinical, virological, and immunological factors on diagnostic performance is necessary for development and implementation of TB transcriptomic signatures in PLHIV.


Results
We screened 1,580 records and included 12 studies evaluating 31 host-blood transcriptomic signatures in 10 test or validation cohorts of PLHIV that differentiated individuals with TB from those with HIV alone, latent Mycobacterium tuberculosis infection, or other diseases (OD). Two (2/10; 20%) cohorts were prospective (29 TB cases; 51 OD) and 8 (80%) case-control (353 TB cases; 606 controls) design. All cohorts (10/10) were recruited in Sub-Saharan Africa and 9/10 (90%) had a high risk of bias. Ten signatures (10/31; 32%) met minimum WHO Target Product Profile (TPP) criteria for TB triage tests. Only one study (1/12; 8%) evaluated prognostic performance of a transcriptomic signature for progression to TB in PLHIV, which did not meet the minimum WHO prognostic TPP.

Open Peer Review
Approval Status

Introduction
There were an estimated 703,000 HIV-associated incident tuberculosis (TB) cases in 2021 however only 368,600 (52%) were notified, with a resultant case fatality rate of 27% 1 . Earlier diagnosis and initiation of treatment, or disease prevention through targeted short-course TB preventive therapy (TPT), may reduce this burden. However, we lack adequate TB mass screening tools to direct confirmatory testing or prognostic tools to guide preventive therapy in the outpatient or community setting. Symptom screening, the most widely used TB triage tool, has low specificity in antiretroviral therapy (ART-) naïve and low sensitivity in ART-experienced people living with HIV (PLHIV) 2 . The addition of chest radiography improves sensitivity, at a cost of reduced specificity 2 . With almost three-quarters of the 38 million PLHIV globally now receiving ART 3 , new tools should be efficacious in this group.
The WHO currently recommends that PLHIV with a positive or unknown tuberculin skin test (TST) result should receive TPT, if active TB has been excluded 4 . There is strong evidence to support such an approach 5 . However, TST and interferon-γ release assay (IGRA) reflect a memory T-cell response following Mycobacterium tuberculosis (Mtb) exposure (sensitisation) and not necessarily ongoing infection. In TB-endemic countries with high rates of Mtb transmission and exposure, these tests have limited utility for guiding TPT 6,7 . In addition, loss or dysfunction of Mtb-specific memory T-cells among immunocompromised PLHIV with low CD4 cell counts results in lower IGRA positivity and may reduce sensitivity for predicting progression to disease 8, 9 . There is also limited evidence regarding repeat courses of TPT among PLHIV; a recent study demonstrated that universal retreatment after one year did not provide additional benefit 10 .
Biomarker-guided treatment has been proposed to target therapy to those that need it most, reducing unnecessary pill burden, drug interactions, and side effects in individuals, and increasing efficacy and cost-effectiveness of mass screening 11,12 . Host-response blood transcriptomic signatures can identify those with active TB and those who are progressing to disease [13][14][15] . Performance of most signatures in adults without HIV meet at least one of the minimum World Health Organization (WHO) Target Product Profile (TPP) TB triage test performance criteria (sensitivity 90% and specificity 70%) for diagnosing prevalent TB 16,17 . Several signatures have been shown to meet minimum prognostic benchmarks (sensitivity 75% and specificity 75%) 18 for short-term prediction of progression to TB disease within six months of testing [19][20][21] . However, only a couple of signatures, Roe1 and Roe3 22,23 , meet these criteria through 12 months of follow-up for progression 21 . Transcriptomic biomarkers selected for advancement through the diagnostics pipeline for development as point-of-care assays should also perform well in PLHIV. We systematically reviewed the published literature on host-response blood transcriptomic biomarkers for diagnosing prevalent and predicting progression to incident TB disease in PLHIV, and compared performance to the WHO TPP criteria.

Protocol and registration
This review is reported in line with the Preferred Reporting Items for Systematic reviews and Meta-Analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) 24 recommendations ( Table 1). The systematic review protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO) on 02 January 2021 with registration number CRD42021224155 and published in the BMJ Open 25 .

Eligibility criteria
We considered cross-sectional and case-control studies, prospective and retrospective cohort studies, and randomised control trials evaluating diagnostic and/or prognostic performance of human host-blood transcriptomic signatures of TB (index tests). Eligible studies included PLHIV in the signature test and/or validation cohorts. Studies that only reported signature discovery cohort performance, or treatment response and failure monitoring cohorts, were not considered. PLHIV of all ages, ethnicities, and in all settings were considered. Studies which did not report any measures of signature performance (sensitivity and specificity, or reported data which enable the reconstruction of a two-by-two table for test accuracy calculation for PLHIV), did not clearly state the case definition of TB disease, did not report primary data, or did not independently report signature performance in PLHIV, were excluded.

Endpoint definitions
The primary TB disease endpoint (target condition) was defined by a positive microbiological test, such as mycobacterial culture or the Xpert MTB/RIF assay (reference standards), in sputum or other bodily fluid sample. Microbiologically-confirmed extra-pulmonary TB disease was also considered. The secondary TB disease endpoint was defined by non-microbiologically-confirmed, presumptive TB diagnosed via composite clinical features. TB disease diagnosed within one month of the index test was presumed to be prevalent disease (diagnostic studies). Prognostic studies were defined as prospective studies in which participants were followed up for progression to incident TB disease with measurement of a transcriptomic signature from blood samples collected at enrolment. Eligible studies included healthy individuals, latent Mtb-infected individuals, or individuals with other respiratory or systemic diseases as a control group. Latent Mtb infection was defined by a positive TST or IGRA.  Table 2, Table 3 Risk of bias and applicability 19 Present evaluation of risk of bias and concerns regarding applicability for each study. Page 14, Figure 2 Results of individual studies 20 For each analysis in each study (e.g. unique combination of index test, reference standard, and positivity threshold) report 2x2 data (TP, FP, FN, TN) with estimates of diagnostic accuracy and confidence intervals, ideally with a forest or receiver operator characteristic (ROC) plot.   December 2020 using Medical Subject Headings (MeSH) and keyword search terms for "Diagnosis", "Messenger RNA", "Biomarkers/blood", "Tuberculosis", and "HIV". The search strategy, including publication date range, were prespecified and published in a systematic review protocol 25 . We reviewed reference lists of eligible articles and performed forward citation tracking using Science Citation Index (via Web of Science) to identify further articles and reports missed by the electronic database search 26 .

Study selection and data collection
Two reviewers (SCM and SV) independently conducted the literature search and screened the search outputs for potential inclusion using EndNote bibliographic software to manage references, as previously described 27 . After removal of duplicates, the selection process included an initial screening of titles and abstracts for relevance, followed by full text review for eligibility. The two reviewers resolved any disagreements or uncertainties by discussion. Data elements of included studies were then independently extracted by the two reviewers. Corresponding authors of potentially eligible studies were contacted to provide deidentified participant-level data to reconstruct two-by-two tables or summary performance data for the PLHIV subgroup. Studies without summary or participant-level data available for the PLHIV subgroup were excluded.

Data analysis
We performed a narrative synthesis of the eligible study cohorts and signatures, including study design, cohort and signature characteristics, and diagnostic and prognostic performance of signatures stratified by study control groups (healthy, latent-Mtb infected, or other disease), and diagnostic reference standards (microbiological or composite clinical). Studies and cohorts were designated by the first author name and year of publication (e.g. Author2019a) and signatures by first author and number of transcripts (e.g. Author11). Signature area under the curve (AUC), sensitivity, and specificity were summarised in forest plots (R forestplot package 28 ). For studies with available participant-level data for the PLHIV subgroup, we were able to recalculate AUC (R pROC package 29 ), and benchmark sensitivity and specificity against the WHO TPP minimum performance criteria for a triage (70% specificity and 90% sensitivity) 16 or prognostic (75% specificity and 75% sensitivity) 18 test. 95% Confidence intervals for AUCs and sensitivity and specificity, were calculated using the DeLong 30 and Wilson binomial 31 methods, respectively. For studies in which participant-level data were not available, we report summary AUC, sensitivity, and specificity estimates, and 95% confidence intervals, for the PLHIV subgroup as published in the original papers. Most of these estimates were not specifically benchmarked against the WHO TPP minimum performance criteria for a triage test.

Risk of bias, applicability, and quality of evidence
The methodological quality and applicability concerns of included studies was assessed by the two reviewers using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool 32 and graphically represented using traffic-light plots (R robvis package 33 ). Risk of bias and applicability concerns for individual study was evaluated in four domains relating to (1) patient selection, (2) measurement of the index test, (3) measurement of the reference standard, and (4) study flow and timing of investigational and diagnostic procedures. Risks of bias are reported within each domain of each study as low risk (no risks of bias identified), some concerns (one risk or unclear risk of bias identified), or high risk (more than one risk or unclear risk of bias identified). Overall risk of bias for each study was reported as low risk (no risks of bias identified in any domains), some concerns (some concerns identified in one domain), or high risk (some concerns identified in two or more domains or high risk of bias in any domain). We assessed the cumulative quality of evidence synthesised by the systematic review using the "Grading of Recommendations Assessment, Development and Evaluation" (GRADE) approach 34 with classification based on study design and limitations, indirectness, inconsistency, imprecision, and publication bias 35,36 .

Search results
We performed the literature search in January 2021, identifying 1,580 unique records published between 1 January 1990 and 31 December 2020, of which 98 full-text articles were assessed for eligibility, and 12 studies 13,22,37-47 met all criteria for inclusion ( Figure 1). The main reasons for study exclusion were absence of PLHIV in study cohorts (n=29), absence of an independent test or external validation cohort which included PLHIV (n=20), and inappropriate index test (n=16) or study design (n=10). Nine of 10 studies excluded for inappropriate design were commentaries or reviews. In addition, deidentified participant-level data to reconstruct two-by-two tables, or summary performance data for PLHIV, were not available and no data were received from corresponding authors for 9 records 19,48-55 .

Study cohorts included in quantitative synthesis
The 12 eligible studies included 10 independent test or validation cohorts featuring PLHIV (Table 2), cumulatively evaluating diagnostic performance of 31 transcriptomic signatures (Table 3) incorporating over 700 unique transcripts. All independent test and validation cohorts enrolled PLHIV from outpatient clinics or hospital inpatients, most with suspicion (symptom-positive) or high risk (initiating antiretroviral therapy) of prevalent TB (  38 included 17 HIV-infected children with non-microbiologically-confirmed, presumptive clinical TB with clinical and radiologic features that prompted empirical treatment. All other study cohorts, with the exception of Anderson2014, enrolled adults and used a microbiological reference standard (culture or Xpert MTB/RIF).
The 40 most frequent transcripts, all included in 3 or more signatures, are listed in Table 4. We used the INTERFEROME database 57 to classify interferon-stimulated genes (ISGs): We defined ISGs as genes significantly up or down regulated in expression (>1.5 fold change) in any human samples treated with Type-I IFN, relative to control samples. Almost all (38/39) of the most common transcripts with available gene annotations were classified as ISGs. The six transcripts most frequently included in signatures were Guanylate Binding Protein (GBP) 5 (11 signatures), GBP6 (10 signatures), Complement C1q B Chain (C1QB) and Fc fragment of IgG receptor Ia (FCGR1A) (7 signatures each), and Basic Leucine Zipper ATF-Like Transcription Factor 2 (BATF2) and GBP2 (6 signatures each).

Quality appraisal of eligible studies
In the patient selection domain, 5 of the 10 independent test and external validation cohorts consecutively or randomly enrolled participants, with the remaining cohorts not reporting sampling method ( Figure 2). Eight of the 10 cohorts utilised a case-control design (Table 2), with exclusion of participants with uncertain diagnosis introducing a high risk of spectrum bias and potentially inflating diagnostic accuracy. The Penn-Nicholson2019b cohort 45 used a prospective design in recruiting symptomatic clinic attendees, but excluded probable and uncertain TB cases from analysis. Only one prospective diagnostic accuracy study, Turner2020 13 , measured signature scores and tested performance in all enrolled participants with clinically suspected tuberculosis, including those with uncertain diagnosis, representative of the target population of symptomatic clinic attendees.
In the index test measurement domain, transcriptomic signature scores were interpreted without knowledge of the reference standard (i.e. blinded) in the Turner2020 13 and Penn-Nicholson2019b 45 cohorts, with unclear reporting for the other studies. Due to the early stage of biomarker development and diverse signature measurement and score calculation methodologies, no studies used pre-specified signature score thresholds. The risk of bias in the reference standard domain was deemed to be low in all cohorts with use of appropriate and standardised microbiological confirmatory TB testing (Mtb culture and Xpert MTB/RIF) likely to correctly classify the target condition. Reference standard results were interpreted without knowledge of the results of the transcriptomic signature scores (i.e. blinded) for all included studies. In the study flow and timing domain, all studies used an appropriate interval between index test and reference standard sample collection, and all participants received the same reference standard tests. However, only the Turner2020 13 study included all participants in analysis. In terms of applicability concerns, selection of participants, and measurement and interpretation of index tests and reference standard matched the review question for all included studies. Overall 9 of the 10 cohorts had a high risk of bias, and one study (Turner2020 13 ) had some concerns due to lack of a pre-specified test threshold prospectively applied to each signature (Figure 2).      Transcriptomic signature diagnostic performance Nine independent test or external validation diagnostic cohorts with PLHIV subgroups were included in the systematic review (Table 2). Three cohorts evaluated diagnostic performance of 12 signatures for discriminating HIV-infected adults with prevalent TB disease from latent-Mtb infected individuals or healthy controls with HIV (Figure 3), and 6 cohorts evaluated diagnostic performance of 29 signatures for discriminating HIV-infected adults or children with prevalent TB disease from those with HIV and other respiratory or systemic diseases (Figure 4).
Only the Kaforou27 (Kaforou2013a and Rajan2018 cohorts), Penn-Nicholson6 (Penn-Nicholson2020a cohort), and Rajan5 (Rajan2018 cohort) signatures met the WHO TPP minimum performance criteria for a triage test for differentiating adults with prevalent TB disease from latent-Mtb infected individuals or healthy controls with HIV in these case-control studies ( Figure 3) Participant-level data were available for only 3 signatures in the Anderson2014b cohort, which also included 17 children with non-microbiologically confirmed, presumptive clinically-diagnosed TB disease (Table 2). All signatures performed poorly in differentiating TB from other diseases in this subset ( Figure 5). There was considerable clinical and methodological heterogeneity between cohorts, with participants recruited in diverse settings, with dissimilar eligibility criteria, and distinct composition of the control groups. Studies also used different signature measurement methods (microarray, RNA sequencing, and RT-qPCR), different methods of signature score calculation, and there were no standardised signature score thresholds. Due to the significant heterogeneity in study design and index test measurement, and limited participant data, a meta-analysis was not deemed appropriate.
Transcriptomic signature prognostic performance Only one study (Darboe2019) 42 evaluated transcriptomic signature prognostic performance for recurrent TB disease in adults with HIV who had recently completed TB therapy. There were no eligible studies evaluating performance for prediction of progression to incident TB disease in PLHIV without prior TB disease. We stratified the prognostic performance of the Darboe11 signature by time from measurement until TB disease recurrence ( Figure 6). Signature prognostic performance was best within 90 days of signature measurement, and waned thereafter. Sensitivity and specificity of the Darboe11 signature did not meet the minimum WHO TPP performance criteria for a prognostic test (75% sensitivity and 75% specificity) 18 in any time window. Sample size was not sufficient to perform subgroup prognostic performance analyses, however Darboe11 signature scores were higher in individuals with detectable plasma HIV viral load (>400 copies/mL) as compared to those with an undetectable plasma viral load (<400 copies/mL; p<0.0001) 42 .

GRADE evidence summary
A total of 10 cohorts, 2 prospective cohorts (29 TB cases; 51 other respiratory diseases) and 8 case-control studies (353 TB cases; 606 controls), were included in this systematic review of the diagnostic and prognostic accuracy of host blood transcriptomic signatures in PLHIV. All studies used reliable reference standards for definitive TB diagnosis. However, we adjudged that there was a very serious risk of bias due to exclusion of participants with indeterminate (non-microbiologically confirmed) TB in numerous studies, removing diagnostic uncertainty, and resulting in reduced diversity of clinical TB disease. Several case-control studies also included healthy asymptomatic controls, healthy Mtb-sensitised (latent Mtb infected) individuals (IGRA or TST positive), or individuals with other uncommon diseases, not reflective of the target population or setting, further exacerbating the spectrum bias. The inclusion of severe TB cases and healthy controls (or controls with inappropriate other diseases) may have resulted in misleadingly high diagnostic accuracy in some of these studies. Other limitations included uncertainty regarding consecutive recruitment, blinding status not clearly stated, and lack of a priori score thresholds.
Indirectness is synonymous with applicability, generalisability, translatability, and external validity of the evidence 36 . Included studies evaluated diagnostic performance among adults or children, within clinical outpatient and hospital inpatient settings, prospectively among symptomatic clinic attendees or within matched case-control cohorts. Some of these settings and populations are not appropriate or relevant to clinical practice, and results are unlikely to be generalisable. The lack of diagnostic uncertainty, spectrum bias, and inappropriate control selection is a concern for external validity of these results. Point-of-care device translatability has only been tested for one signature (Sweeney3) among PLHIV, with unsatisfactory diagnostic accuracy 46 . Technical variability and operator reliability have not been tested on point-of-care platforms for any tests.
With regards to downstream effects, false negative signature results (patients incorrectly classified as not having TB) may have serious consequences, with delayed TB diagnosis resulting in Mtb transmission to close contacts, and increased risk of morbidity and mortality. While consequences are less serious, false positive results (individuals incorrectly classified as having TB) may result in costly further investigations, or 6 months of curative therapy with potential adverse effects and without apparent benefit. Incorrect diagnosis of TB may result in missed or delayed alternate diagnosis and treatment, with potential downstream consequences. Misdiagnosis of TB may also result in stigmatisation from family and community, and psychological distress. There is no uncertainty regarding true positive and true negative results.
We found very serious risks of inconsistency, with significant unexplained heterogeneity of diagnostic sensitivity and specificity estimates for signatures in different validation cohorts and settings. No signatures consistently met the WHO TPP criteria in all or most cohorts, suggesting publication bias toward more optimistic signature performance, particularly in discovery cohorts. There was also a very serious risk of imprecision, with small sample sizes and wide confidence intervals for estimates of test accuracy among PLHIV, with no pooling of data. Data was not available or accessible for the PLHIV subgroup in numerous cohorts. In summary, the data included in this review provides very low quality evidence and we would not recommend any changes to clinical practice based on these results.

Discussion
TB transcriptomic biomarkers selected for advancement through the diagnostics pipeline for further development as point-of-care tests should ideally perform well in PLHIV. We systematically searched online databases for studies which evaluated the performance of host-blood transcriptomic signatures for diagnosing prevalent TB and identifying those who are progressing to incident TB in PLHIV, and compared performance to the WHO TPP criteria. We found 12 studies published prior to 2021 which included 10 independent test or validation cohorts featuring PLHIV, evaluating 31 transcriptomic signatures. Several of the signatures approached or met the WHO TPP minimum performance criteria for a triage test for differentiating people with prevalent TB disease from latent-Mtb infected individuals, healthy controls, or individuals with other respiratory or systemic diseases 16 . However, no transcriptomic signatures met the TPP benchmark criteria for a non-sputum confirmatory diagnostic test among PLHIV. The signatures also performed poorly for diagnosing non-microbiologically confirmed, presumptive TB disease.
Only one cohort evaluated a signature for predicting TB disease recurrence in individuals who recently completed TB treatment and initiated ART. Prognostic performance appeared to be superior proximally to incident TB disease, with highest AUC and specificity in the 3 months preceding TB recurrence 42 . The Darboe11 signature did not meet the WHO TPP for a prognostic test in any time window in this population 18 .
Among the 31 signatures evaluated, we found that the genes most frequently incorporated in TB transcriptomic signatures were ISGs, which may also be upregulated by chronic HIV viraemia 62,63 . We hypothesised that these signatures would be less discriminatory for TB in PLHIV, particularly among viraemic ART-naïve individuals, due to an increased abundance of circulating type-I IFNs. While no studies performed subgroup analyses of diagnostic accuracy by HIV plasma viral load, Darboe and colleagues 42 reported higher signature scores in individuals with a viral load greater than 400 copies per mL, as compared to those with an undetectable viral load (<400 copies/mL). Södersten and colleagues 46 demonstrated decreased Sweeney3 specificity in adults with a CD4 cell count less than 200, as compared to those with CD4 cell count greater than 200. Low CD4 cell count is a proxy for ART-naivety and high HIV plasma viral load. The lower specificity is possibly due to higher signature scores in the control group, either due to HIV viraemia, undiagnosed early or minimal TB, or other opportunistic infections. By specifically excluding ISGs, Esmail and colleagues 62 have demonstrated that classical complement pathway and Fc-γ receptor 1 (FCGR1) genes are also differentially expressed in individuals with subclinical HIV-associated TB. While traditional discriminant analysis yields an overabundance of ISGs and an underabundance of B-and T-cell genes in active TB patients versus latently Mtb-infected controls, Singhania and colleagues 53 have shown that a modular approach (i.e. pre-filtering genes by functional modules) results in a more diverse gene set.
The synthesised systematic review results represent an overall low quality of evidence, with lack of generalisability and external validity, inconsistency in results between studies, and imprecision in estimates. Most of the discovery and validation studies were conducted in Africa, particularly South Africa, limiting geographic diversity and generalisability of results. There were also few training and test datasets including PLHIV, with a notable overreliance on the Kaforou2013 and Anderson2014 datasets for signature discovery and validation, further limiting generalisability. Also, all eligible validation cohorts were from outpatient or inpatient settings. It is notable that signatures generally performed best in small test sets derived from the same population as the signature training cohort (e.g. Rajan5 in Rajan2018 test cohort, Kaforou27 in Kaforou2013a test cohort, Kaforou44 in Kaforou2013b, Anderson51 in Anderson2014b, and Penn-Nicholson6 in Penn-Nicholson2020a cohort), and performance waned in subsequent external validation. The differences in signature performance between cohorts, with signatures meeting WHO TPP criteria in one cohort but not others, is also likely attributable to differences in discovery and validation cohort designs, and publication bias. Multicohort gene meta-analytical methods, similar to those employed by Sweeney and colleagues 49 , may help to overcome such limitations, and result in greater reproducibility of performance across cohorts. Strengths of this systematic review include the comprehensive search strategy, with rigorous eligibility criteria, and publication of a peer-reviewed study protocol. The review also had several limitations. The pre-specified literature search strategy only included studies published prior to 2021; there were few studies with data available for an PLHIV subgroup in test or validation sets in this period, and most included cohorts were small and underpowered. The use of published summary performance data is also problematic due to different statistical methods used in original papers. We were also unable to obtain summary diagnostic performance estimates, or participant-level data to reconstruct two-by-two tables, for PLHIV subgroups from 9 studies, despite contacting corresponding authors. Most papers did not conform to the standard reporting guidelines for diagnostic accuracy studies (STARD) 66 , with missing information particularly relating to study design, participant recruitment, and blinding to reference standard result. Requisite anonymised participant data with signature scores were only available for a handful of studies. The lack of participant-level metadata precluded subgroup analyses. We were thus unable to systematically determine the effect of HIV viral load, CD4 cell count, TPT, and ART on transcriptomic signature diagnostic performance. Metanalysis was also deemed inappropriate due to the clinical and methodological heterogeneity and high risk of bias in cohort design and signature measurement method, with no predetermined score cut-offs or methods of standardising scores across platforms.
In the two years subsequent to the completion of this systematic literature review, diagnostic and prognostic performance of several transcriptomic signatures measured by qPCR were prospectively tested for mass screening in a South African community setting amongst predominantly asymptomatic PLHIV who were not seeking care 21 Additionally, most of the cohorts evaluated in this systematic review included adults, with only the Anderson2014 paediatric cohort eligible for inclusion. Paediatric TB is particularly difficult to diagnose due to its paucibacillary nature and difficulty in obtaining sputum samples from small children [73][74][75][76] . An accurate non-sputum diagnostic, such as host response transcriptomic signatures, would transform the diagnosis of TB in children 77 . However, young children are frequent vectors for respiratory and gastrointestinal viruses, and specificity of transcriptomic signatures may be low due to induction of ISG signalling from viral infections.
Transcriptomic signatures show promise for screening for prevalent TB to guide further investigations and predicting progression to incident TB for targeted TB preventive therapy 17 . However, evidence among PLHIV is limited and mostly from small case-control studies with high risk of spectrum bias. This review emphasises the need for larger heterogenous prospective discovery and validation cohorts exclusively consisting of PLHIV, or mixed cohorts which include PLHIV. Such cohorts should ideally comparatively test biomarker performance side-by-side to determine which signature should be advanced through the developmental pipeline. Further research exploring the effect of HIV clinical, virological, and immunological status is necessary for the design and implementation of TB transcriptomic signatures in this population who are at heightened risk of TB and its sequelae.

Data availability statement
All data underlying the results are available in the original published manuscripts or on request from corresponding authors.
No original data are associated with this article.

Open Peer Review
biomarkers It is indeed a very important topic as there is a great need for accurate diagnosis of TB in HIV background, so that appropriate treatment can be provided, and the spread of TB can be checked. The need for the analysis is well brought out in the review.
The authors have conducted a systematic survey for relevant transcriptomic data reported over a period of 30 years and use well-defined criteria for inclusion and exclusion of the signatures, and list clear objectives and evaluation metrics and also quality assessment of the diagnostic accuracy. The manuscript is overall well written. It would be helpful if the authors could elaborate on the following points, so as to make it more accessible to a wider audience. A brief note could be added on the WHO TPP minimum performance criteria, and why some datasets did not qualify -especially to bring in the difference between 'lack of required data/comparisons' in the original study versus 'signatures not performing well enough'.

1.
A brief introduction on the risk of bias and the various domains under which it is assessed and how it is addressed.

2.
The statement "Overall 9 of the 10 cohorts had a high risk of bias, and one study (Turner2020) had some concerns" could be followed by some recommendation on how to read meaning from these studies. It would also help if the term 'concerns' were explained and placed in 3.
the risk spectrum.
As some studies include more than one sub-objective of diagnosis/prognosis, it would help if a note is added in discussion on the diagnostic accuracy/potential for each comparison axis separately.

4.
Are the rationale for, and objectives of, the Systematic Review clearly stated? Yes

Is the statistical analysis and its interpretation appropriate? Yes
Are the conclusions drawn adequately supported by the results presented in the review? Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Systems Biology, Tuberculosis, Genomics and Bioinformatics, Diagnostic biomarkers I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 29 Apr 2023

Simon Mendelsohn
Thank you for your thoughtful review of our manuscript. We appreciate your positive feedback and suggestions for improvement. Please find our response to each of your points below: 1. We include the following note on the WHO TPP minimum performance criteria in the introduction: "… the minimum World Health Organization (WHO) Target Product Profile (TPP) TB triage test performance criteria (sensitivity 90% and specificity 70%) for diagnosing prevalent TB" and "…minimum prognostic benchmarks (sensitivity 75% and specificity 75%)".
Regarding why "some datasets did not qualify -especially to bring in the difference between 'lack of required data/comparisons'": per published systematic review protocol eligibility criteria (see Mendelsohn, BMJ Open, 2021), we excluded datasets which did not include PLHIV, or where it was not possible to stratify results by HIV status, where only signature discovery performance was reported (no independent validation), or where the controls or cases were not adequately defined. We also excluded studies where we were not able to retrieve participant level data, or calculated signature scores.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 29 Apr 2023

Simon Mendelsohn
Thank you very much for taking the time to carefully review our study and for your constructive feedback. We appreciate your positive comments regarding the clarity of our writing and the description of our methods.
1. Regarding your first suggestion, we agree that we could have been clearer about how we arrived at the overall risk of bias for each study in Figure 2. We will revise the "Risk of bias, applicability, and quality of evidence" Methods section to provide clarity.
2. Regarding your second comment, we tried to avoid overlap between the "Patient selection" and "Study flow and timing" domains in QUADAS-2. For the "Patient selection" domain, we specifically asked "Did the study avoid inappropriate exclusions?" per the eligibility criteria (i.e. were the eligibility criteria appropriate for the study design). For example, did the study inappropriately exclude individuals with comorbidities, such as diabetes, or individuals who use drugs.
For the "Study flow and timing" domain, we asked whether "…all [eligible] patients [were] included in the analysis?" (i.e. was there inappropriate exclusion of eligible/enrolled participants from analysis due to, for example, inconclusive diagnostic test results/diagnostic uncertainty or loss to follow up). However, we acknowledge that there is some subjectivity in this process.
3. Regarding your third suggestion, we appreciate your suggestion to expand on the performance of signatures across multiple cohorts. We did not identify any signatures that consistently met the WHO TPP criteria in all or most cohorts. We agree that this may suggest bias toward more optimistic performance. It is notable that signatures generally performed best in small test sets derived from the same population as the signature training cohort (e.g. Rajan5 in Rajan2018 test cohort, Kaforou27 in Kaforou2013a test cohort, Kaforou44 in Kaforou2013b, Anderson51 in Anderson2014b, and Penn-Nicholson6 in Penn-Nicholson2020a cohort), and performance waned in subsequent external validation. The differences in signature performance between cohorts, with signatures meeting WHO TPP criteria in one cohort but not others, is also likely attributable to differences in discovery and validation cohort designs. Multicohort gene meta-analytical discovery and validation methods, similar to those employed by Sweeney and colleagues, may help to overcome such limitations.
We will expand on the above in the Results and Discussion sections.
4. Regarding your final suggestion, we agree that large, heterogeneous prospective cohorts that include PLHIV are the gold standard for discovering and validating host blood transcriptomic signatures for TB diagnosis and risk prediction. We also acknowledge that such studies are expensive. This review suggests that transcriptomic/gene signatures discovered in smaller studies are unlikely to be reproducible across multiple distinct cohorts. We agree that using multicohort gene meta-analytical methods in existing microarray or sequencing data is a more pragmatic use of resources, and may result in greater reproducibility of performance across new validation cohorts. However, we argue that there is insufficient existing microarray or sequencing data among people living with HIV, hence the need for large prospective cohorts in this population.
Thank you again for your thoughtful review and helpful suggestions. We will address these points in our revised manuscript.

Competing Interests:
No competing interests were disclosed.